Plant chimeric binding polypeptides for universal molecular recognition

ABSTRACT

Libraries of nucleic acids encoding chimeric binding polypeptides based on plant scaffold polypeptide sequences. Also described are methods for generating the libraries.

This application is a divisional of co-pending U.S. application Ser. No.13/093,518, filed Apr. 25, 2011, which is a divisional of U.S.application Ser. No. 11/706,847, filed Feb. 13, 2007, now U.S. Pat. No.7,951,753, which claims the benefit of U.S. provisional application Ser.No. 60/773,086, filed Feb. 13, 2006. The entire contents of each ofthese priority applications are considered part of the presentapplication and are hereby incorporated in the present application intheir entirety.

BACKGROUND

The binding specificity and affinity of a protein for a target aredetermined primarily by the protein's amino acid sequence within one ormore binding regions. Accordingly, varying the amino acid sequence ofthe relevant regions reconfigures the protein's binding properties.

In nature, combinatorial changes in protein binding are best illustratedby the vast array of immunoglobulins produced by the immune system. Eachimmunoglobulin includes a set of short, virtually unique, amino acidsequences known as hypervariable regions (i.e., protein bindingdomains), and another set of longer, invariant sequences known asconstant regions. The constant regions form β sheets that stabilize thethree dimensional structure of the protein in spite of the enormoussequence diversity among hypervariable regions in the population ofimmunoglobulins. Each set of hypervariable regions confers bindingspecificity and affinity. The assembly of two heavy chain and two lightchain immunoglobulins into a large protein complex (i.e., an antibody)further increases the number of combinations with diverse bindingactivities.

The binding diversity of antibodies has been successfully exploited inmany biomedical and industrial applications. For example, libraries havebeen constructed that express immunoglobulins bearing artificiallydiversified hypervariable regions. Immunoglobulin expression librariesare very useful for identifying high affinity antibodies to a targetmolecule (e.g., a receptor or receptor ligand). A nucleic acid encodingthe identified immunoglobulin can then be isolated and expressed in hostcells or organisms.

However, despite the usefulness of immunoglobulins and antibodies ingeneral, their expression in transgenic plants can be problematic.Immunoglobulins may not fold properly in plant cytoplasm because theyrequire the formation of multiple disulfide bonds. Further, the largesize of immunoglobulins prevents their effective uptake by some plantpests. Thus, immunoglobulins are frequently not useful as proteinpesticides or pesticide targeting molecules. Finally, expressingmammalian proteins such as immunoglobulins (e.g., as so called“plantibodies”) in edible plants also raises potential issues ofconsumer acceptance and is thus an impediment to commercialization. Thismay effectively prevent use of plantibodies for many input and outputtraits in transgenic plants.

The above-mentioned disadvantages of immunoglobulins can be circumventedby generating diverse libraries of binding proteins from other classesof structurally tolerant proteins, preferably plant-derived proteins.These libraries can be screened to identify individual proteins thatbind with desired specificity and affinity to a target of interest.Afterwards, identified binding proteins can be efficiently expressed intransgenic plants.

SUMMARY

Diverse libraries of nucleic acids encoding plant chimeric bindingpolypeptides, as well as methods for generating them are describedherein. The chimeric binding polypeptides are conceptually analogous toimmunoglobulins in that they feature highly varied binding domains inthe framework of unvarying sequences that encode a structurally robustprotein. However, the chimeric binding polypeptides described hereinhave the considerable advantage of being derived from plant proteinsequences thereby avoiding many of the problems associated withimmunoglobulin expression in plants. The amino acid sequences of theencoded plant chimeric binding proteins are derived from a scaffoldpolypeptide sequence that includes subsequences to be varied. The variedsubsequences correspond to putative binding domains of the plantchimeric binding polypeptides, and are highly heterogeneous in thelibrary of encoded plant chimeric binding proteins. In contrast thesequence of the encoded chimeric binding proteins outside of the variedsubsequences is essentially the same as the parent scaffold polypeptidesequence and highly homogeneous throughout the library of encoded plantchimeric binding proteins. Such libraries can serve as a universalmolecular recognition platform to select proteins with high selectivityand affinity binding for expression in transgenic plants.

Accordingly, one aspect described herein is a library of nucleic acidmolecules encoding at least ten (e.g., at least 1,000, 10⁵, or 10⁶)different chimeric binding polypeptides. The amino acid sequence of eachpolypeptide includes C₁-X₁-C₂-X₂-C₃-X₃₋C₄, where C₁-C₄ are backbonesubsequences selected from purple acid phosphatase (i.e., SEQ ID NOs:1-30, 31-60, 61-90, and 91-120, respectively) that can include up to 30(e.g., 20, 10, or 5) single amino acid substitutions, deletions,insertion, or additions to the selected purple acid phosphatasesequences. The C₁-C₄ subsequences are homogeneous across many of thepolypeptides encoded in the library. In contrast to the C₁-C₄ backbonesubsequences, the X₁-X₃ subsequences are independent variablesubsequences consisting of 2-20 amino acids, and these subsequences areheterogeneous across many of the polypeptides in the library. Forexample, the library of chimeric polypeptides can have the amino acidsequence of any one of SEQ ID NOs:124-126 including one to ten singleamino acid substitutions, deletions, insertions, or additions to aminoacid positions corresponding to 23-39, 51-49, and 79-84 of SEQ IDNOs:124-126.

Another aspect described herein is a method for generating thejust-described library. The method includes providing a parental nucleicacid encoding a plant scaffold polypeptide sequence containingC₁-X₁-C₂-X₂-C₃-X₃₋C₄ as defined above. The method further includesreplicating the parental nucleic acid (e.g., at least one of the X₁-X₃subsequences is selected from SEQ ID NOs: 121-123) under conditions thatintroduce up to 10 single amino acid substitutions, deletions,insertions, or additions to the parental X₁, X₂, or X₃ subsequences,whereby a heterogeneous population of randomly varied subsequencesencoding X₁, X₂, or X₃ is generated. The population varied subsequencesis then substituted into a population of parental nucleic acids at thepositions corresponding to those encoding X₁, X₂, or X₃. The amino acidsubstitutions, deletions, insertions or additions can be introduced intothe parental nucleic acid subsequences by replication in vitro (e.g.,using a purified mutagenic polymerase or nucleotide analogs) or in vivo(e.g., in a mutagenic strain of E. coli). The just-described library canbe introduced into a biological replication system (e.g., E. coli orbacteriophage) and amplified.

A related aspect described herein is another method for generating theabove-described library of nucleic acids. The method includes selectingan amino acid sequence containing C₁-X₁-C₂-X₂-C₃-X₃₋C₄ as defined above.The method further includes providing a first and second set ofoligonucleotides having overlapping complementary sequences.Oligonucleotides of the first set encode the C₁-C₄ subsequences andmultiple heterogeneous X₁-X₃ subsequences. Oligonucleotides of thesecond set are complementary to nucleotide sequences encoding the C1-C4subsequences and multiple heterogeneous X₁-X₃ subsequences. The two setsof oligonucleotides are combined to form a first mixture and incubatedunder conditions that allow hybridization of the overlappingcomplementary sequences. The resulting hybridized sequences are thenextended to form a second mixture containing the above-describedlibrary.

Yet another aspect of the invention is a library of nucleic acidsencoding chimeric binding polypeptides each of which include an aminoacid sequence at least 70% (i.e., any percentage between 70% and 100%)identical to any of SEQ ID NOs: 127-129. The amino acid sequence of eachof the encoded polypeptides includes amino acids that differ from thoseof SEQ ID NOs: 127-129 at positions 14, 15, 33, 35-36, 38, 47-48, 66,68-69, 71, 80, 81, 99, 101-102, and 104, and the amino acid differencesare heterogeneous across a plurality of the encoded polypeptides. Theamino acid sequence of each of the encoded polypeptides outside of theabove-listed positions is homogeneous across a plurality of the encodedchimeric polypeptides.

A related aspect described herein is a method for generating thejust-described library. The method includes selecting an amino acidsequence corresponding to any of SEQ ID NOs: 127-129, in which theselected sequence differs from SEQ ID NOs:127-129 in at least one theabove-mentioned positions. The method further includes providing a firstand second set of oligonucleotides having overlapping complementarysequences. Oligonucleotides of the first set encode subsequences of theselected amino acid sequence, the subsequences being heterogeneous atthe above-mentioned positions. Oligonucleotides of the second set arecomplementary to nucleotide sequences encoding subsequences of theselected amino acid sequence, the subsequences being heterogeneous atthe above-mentioned positions. The two sets of oligonucleotides arecombined to form a first mixture and incubated under conditions thatallow hybridization of the overlapping complementary sequences. Theresulting hybridized sequences are then extended to form a secondmixture containing the above-described library.

Various implementations of the invention can include one or more of thefollowing. For example, each nucleic acid in a library can include avector sequence. Also featured is any nucleic acid isolated from one ofthe above-described libraries, as well as the chimeric bindingpolypeptide encoded by it, in pure form.

In one implementation, a population of cells (or individual cellsselected from the population of cells) is provided which expresschimeric binding polypeptides encoded by one of the libraries. Anotherimplementation features a library of purified chimeric bindingpolypeptides encoded by one the nucleic acid libraries. Yet anotherimplementation provides a population of filamentous phage displaying thechimeric binding polypeptides encoded by one of the nucleic acidlibraries.

In various implementations of methods for generating the above describednucleic acid libraries by oligonucleotide assembly, one or more of thefollowing can be included. For example, the method can further include,after the second mixture that contains the nucleic acid library isgenerated, performing a cycle of denaturing the population of nucleicacids followed by a hybridization and an elongation step. Optionally,this cycle can be repeated (e.g., up to 100 times). The nucleic acidlibraries can be amplified by a polymerase chain reaction that includesa forward and a reverse primer that hybridize to the 5′ and 3′ endsequences, respectively, of all nucleic acids in the library. In oneimplementation, amino acids to be encoded in variable sequence positionsare selected from a subset (e.g., only 4, 6, 8, 10, 12, 14 or 16) ofalanine, arginine, asparagine, aspartate, glutamine, glutamate, glycine,histidine, isoleucine, leucine, lysine, methionine, phenylalanine,proline, serine, threonine, tryptophan, tyrosine, cysteine and valine(the 20 naturally occurring amino acids). In other cases 19 of the 20are used (excludes cysteine). In other cases all 20 are used. In anotherimplementation, the subset of amino acids includes at least onealiphatic, one acidic, one neutral, and one aromatic amino acid (e.g.,alanine, aspartate, serine, and tyrosine).

Described herein is library of nucleic acids encoding at least tendifferent polypeptides, the amino acid sequence of each polypeptidecomprising:

C1-X1-C2-X2-C3-X3-C4, wherein: (i) subsequence C1 is selected from SEQ.ID NOs:1-30, subsequence C2 is selected from SEQ ID NOs:31-60,subsequence C3 is selected from SEQ. ID NOs:61-90; subsequence C4 isselected from SEQ. ID NOs:91-120, and each of C1-C4 comprise up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; (ii) C1-C4 are homogeneous across a pluralityof the encoded polypeptides; (iii) each of X1-X3 is an independentlyvariable subsequence consisting of 2-20 amino acids; and each of X1-X3are heterogeneous across a plurality of the encoded polypeptides.

Also described is a library of nucleic acids encoding at least tendifferent polypeptides, the amino acid sequence of each polypeptidecomprising:

C1-X1-C2-X2-C3-X3-C4, wherein: (i) subsequence C1 is selected from FIG.2 or FIG. 4, subsequence C2 is selected from FIG. 2 or FIG. 4,subsequence C3 is selected from FIG. 2 or FIG. 4; subsequence C4 isselected from FIG. 2 or FIG. 4, and each of C1-C4 comprise up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; (ii) C1-C4 are homogeneous across a pluralityof the encoded polypeptides

(iii) each of X1-X3 is an independently variable subsequence consistingof 2-20 amino acids; and each of X1-X3 are heterogeneous across aplurality of the encoded polypeptides.

Also described is a library of nucleic acids encoding at least tendifferent polypeptides, the amino acid sequence of each polypeptidecomprising:

C1-X1-C2-X2-C3-X3-C4, wherein (i) subsequence C1 is selected from FIG. 3or FIG. 5, subsequence C2 is selected from FIG. 3 or FIG. 5, subsequenceC3 is selected from FIG. 3 or FIG. 5; subsequence C4 is selected fromFIG. 3 XX, and each of C1-C4 comprise up to 30 single amino acidsubstitutions, deletions, insertions, or additions to the selectedsubsequence; (ii) C1-C4 are homogeneous across a plurality of theencoded polypeptides (iii) each of X1-X3 is an independently variablesubsequence consisting of 2-20 amino acids; and each of X1-X3 areheterogeneous across a plurality of the encoded polypeptides.

In various embodiments: at least 1,000 different polypeptides areencoded; at least 100,000 different polypeptides are encoded; at least1,000,000 different polypeptides are encoded; each of C1-C4independently comprises up to 20 single amino acid substitutions,deletions, insertions, or additions to the selected subsequence; each ofC1-C4 independently comprises up to 10 single amino acid substitutions,deletions, insertions, or additions to the selected subsequence; each ofC1-C4 independently comprises up to 5 single amino acid substitutions,deletions, insertions, or additions to the selected subsequence; none ofC1-C4 comprise amino acid substitutions, deletions, insertions, oradditions to the selected subsequence; amino acids of X1-X3 are selectedfrom fewer than 20 amino acids genetically encoded in plants; aminoacids of X1-X3 are selected from all 20 amino acids genetically encodedin plants; the fewer than 20 genetically encoded amino acids include atleast one aliphatic amino acid, at least one acidic amino acid, at leastone neutral amino acid, and at least one aromatic amino acid; fewer than20 genetically encoded amino acids comprise alanine, aspartate, serine,and tyrosine.

In some cases: the amino acid sequence of each polypeptide is selectedfrom:

(a). a polypeptide comprising C1-X1-C2-X2-C3-X3-C4 wherein C1=SEQ. IDNO:1, C2=SEQ. ID NO: 31, C3=SEQ. ID NO: 61, and C4=SEQ. ID NO: 91;

(b). a polypeptide comprising C1-X1-C2-X2-C3-X3-C4 wherein C1=SEQ. IDNO:2, C2=SEQ. ID NO: 32, C3=SEQ. ID NO: 62, and C4=SEQ. ID NO: 92; and

(c). a polypeptide comprising C1-X1-C2-X2-C3-X3-C4 wherein C1=SEQ. IDNO:3, C2=SEQ. ID NO: 33, C3=SEQ. ID NO: 63, and C4=SEQ. ID NO: 93.

In some cases: each encoded polypeptide comprises C1-X1-C2-X2-C3-X3-C4,wherein C1=SEQ. ID NO: X1, C2=SEQ. ID NO: X2, C3=SEQ. ID NO: X3, andC4=SEQ. ID NO: X4; designated SEQ. ID NO: 130.

In some cases: each encoded polypeptide comprises C1-X1-C2-X2-C3-X3-C4,wherein C1=SEQ. ID NO: X1, C2=SEQ. ID NO: X2, C3=SEQ. ID NO: X3, andC4=SEQ. ID NO: X4; designated SEQ. ID NO: 130.

In some embodiments: wherein each of the nucleic acids comprises avector sequence.

Also described: are an isolated nucleic acid selected from the libraryand a isolated cell expressing the nucleic acid as well as a purifiedlibrary of purified polypeptides encoded by the library; and apopulation of filamentous phage displaying the polypeptides encoded bythe library.

Described herein is a method of generating a library, comprising: (i)providing a parental nucleic acid encoding a parental polypeptidecomprising the amino acid sequence: C1-X1-C2-X2-C3-X3-C4, whereinsubsequence C1 is selected from SEQ ID NOs:1-30, subsequence C2 isselected from SEQ ID NOs:31-60, subsequence C3 is selected from SEQ IDNOs:61-90; subsequence C4 is selected from SEQ ID NOs:91 120; each ofC1-C4 comprises up to 10 single amino acid substitutions, deletions,insertions, or additions to the selected subsequence; and each of X1-X3is an independent subsequence consisting of 2-20 amino acids; (ii)replicating the parental nucleic acid under conditions that introduce upto 10 single amino acid substitutions, deletions, insertions, oradditions to the X1, X2, or X3 subsequences, whereby a population ofrandomly varied subsequences encoding X1′, X2′, or X3′ is generated; and(iii) the population of randomly varied subsequences X1′, X2′, or X3′ issubstituted, into a population of parental nucleic acids at thepositions corresponding to those that encode X1, X2, or X3.

In various instances: at least one of the X1-X3 subsequences is selectedfrom SEQ ID NOs:121-123; each of C1-C4 independently comprises up to 20single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises up to 5single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; none of C1-C4 comprise amino acidsubstitutions, deletions, insertions, or additions to the selectedsubsequence; the replicating generates a heterogeneous population ofrandomly varied subsequences by introducing up to 5 amino acidsubstitutions in each of X1, X2, or X3; the method further comprisesamplifying the library by introducing it into a biological replicationsystem and proliferating the biological replication system; thebiological replication system is a plurality of E. coli cells; thebiological replication system is a plurality of bacteriophage; thereplicating occurs in vitro; the replicating is performed with apurified mutagenic polymerase; the replicating is performed in thepresence of a nucleotide analog; the replicating occurs in vivo; thereplicating in vivo occurs in a mutagenic species of E. coli.

Also described is a method of generating the library of claim 1,comprising:(i) selecting an amino acid sequence comprising the aminoacid sequence C1-X1-C2-X2 C3 X3-C4 to be encoded, wherein: (a)subsequence C1 is selected from SEQ ID NOs:1-30, subsequence C2 isselected from SEQ ID NOs:31-60, subsequence C3 is selected from SEQ IDNOs:61 90, and subsequence C4 is selected from SEQ ID NOs:91-120; (b)each of C1-C4 comprises up to 10 single amino acid substitutions,deletions, insertions, or additions to the selected subsequence; (c)each of X1, X2, and X3 consists of an amino acid sequence 2-20 aminoacids in length; (ii) providing a first plurality and a second pluralityof oligonucleotides, wherein: (a) oligonucleotides of the firstplurality encode the C1-C4 subsequences and multiple heterogeneous X1-X3variant subsequences X1′-X3′; (b) oligonucleotides of the secondplurality are complementary to nucleotide sequences encoding the C1-C4subsequences and to nucleotide sequences encoding multiple heterogeneousX1′ X3′ subsequences; and (c) the oligonucleotides of the first andsecond pluralities have overlapping sequences complementary to oneanother; (iii) combining the population of oligonucleotides to form afirst mixture; (iv) incubating the mixture under conditions effectivefor hybridizing the overlapping complementary sequences to form aplurality of hybridized complementary sequences; and (v) elongating theplurality of hybridized complementary sequences to form a second mixturecontaining the library.

In various instances: each of C1-C4 independently comprises up to 20single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises fromzero and up to 5 single amino acid substitutions, deletions, insertions,or additions to the selected subsequence; the method further comprisesperforming a cycle of steps, the cycle of steps comprising denaturingthe library by increasing the temperature of the second mixture to atemperature effective for denaturing double stranded DNA, followed bysteps (iv) and (v); the method comprises repeating the cycle of steps upto 100 times; the method further comprises amplifying the library by apolymerase chain reaction consisting essentially of the library, aforward primer, and a reverse primer, wherein the forward and reverseprimers can hybridize to the 5′ and 3′ end sequences, respectively, ofall nucleic acids in the library; the amino acid to be encoded in eachposition of the X1, X2, or X3 subsequences, is selected from a subset ofalanine, arginine, asparagine, aspartate, cysteine, glutamine,glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine,phenylalanine, proline, serine, threonine, tryptophan, tyrosine, andvaline; herein the amino acid selected for each single amino acidsubstitution is selected from a group of amino acids consisting of atleast one aliphatic, at least one acidic, at least one neutral, and atleast one aromatic amino acid; and the group of amino acids consists ofalanine, aspartate, serine, and tyrosine.

Also described herein is a method of generating a library, comprising:(i) providing a parental nucleic acid encoding a parental polypeptidecomprising the amino acid sequence: C1-X1-C2-X2-C3-X3-C4, whereinsubsequence C1 is selected from FIG. 2 or FIG. 4, subsequence C2 isselected from FIG. 2 or FIG. 4, subsequence C3 is selected from FIG. 2or FIG. 4; subsequence C4 is selected from FIG. 2 or FIG. 4 each ofC1-C4 comprises up to 10 single amino acid substitutions, deletions,insertions, or additions to the selected subsequence; and each of X1-X3is an independent subsequence consisting of 2-20 amino acids; (ii)replicating the parental nucleic acid under conditions that introduce upto 10 single amino acid substitutions, deletions, insertions, oradditions to the X1, X2, or X3 subsequences, whereby a population ofrandomly varied subsequences encoding X1′, X2′, or X3′ is generated; and(iii) the population of randomly varied subsequences X1′, X2′, or X3′ issubstituted, into a population of parental nucleic acids at thepositions corresponding to those that encode X1, X2, or X3.

In various embodiments: at least one of the X1-X3 subsequences isselected from SEQ ID NOs:121-123; each of C1-C4 independently comprisesup to 20 single amino acid substitutions, deletions, insertions, oradditions to the selected subsequence; each of C1-C4 independentlycomprises up to 10 single amino acid substitutions, deletions,insertions, or additions to the selected subsequence; each of C1-C4independently comprises up to 5 single amino acid substitutions,deletions, insertions, or additions to the selected subsequence; none ofC1-C4 comprise an amino acid substitutions, deletions, insertions, oradditions to the selected subsequence; the replicating generates aheterogeneous population of randomly varied subsequences by introducingup to 5 amino acid substitutions in each of X1, X2, or X3; the methodfurther comprises amplifying the library by introducing it into abiological replication system and proliferating the biologicalreplication system; the biological replication system is a plurality ofE. coli cells; the biological replication system is a plurality ofbacteriophage; the replicating occurs in vitro; the replicating isperformed with a purified mutagenic polymerase the replicating isperformed in the presence of a nucleotide analog; the replicating occursin vivo; and the replicating in vivo occurs in a mutagenic species of E.coli.

Also described is a method of generating the library, comprising: (i)selecting an amino acid sequence comprising C1-X1-C2-X2 C3 X3-C4 to beencoded, wherein (a) subsequence C1 is selected from FIG. 2 or FIG. 4,subsequence C2 is selected from FIG. 2 or FIG. 4, subsequence C3 isselected from FIG. 2 or FIG. 4, and subsequence C4 is selected from FIG.2 or FIG. 4; (b) each of C1-C4 comprises up to 10 single amino acidsubstitutions, deletions, insertions, or additions to the selectedsubsequence; (c) each of X1, X2, and X3 consists of an amino acidsequence 2-20 amino acids in length; (ii) providing a first pluralityand a second plurality of oligonucleotides, wherein (a) oligonucleotidesof the first plurality encode the C1-C4 subsequences and multipleheterogeneous X1-X3 variant subsequences X1′-X3′; (b) oligonucleotidesof the second plurality are complementary to nucleotide sequencesencoding the C1-C4 subsequences and to nucleotide sequences encodingmultiple heterogeneous X1′ X3′ subsequences; and

(c) the oligonucleotides of the first and second pluralities haveoverlapping sequences complementary to one another; (iii) combining thepopulation of oligonucleotides to form a first mixture; (iv) incubatingthe mixture under conditions effective for hybridizing the overlappingcomplementary sequences to form a plurality of hybridized complementarysequences; and (v) elongating the plurality of hybridized complementarysequences to form a second mixture containing the library.

In various cases: each of C1-C4 independently comprises up to 20 singleamino acid substitutions, deletions, insertions, or additions to theselected subsequence; each of C1-C4 independently comprises up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises fromzero and up to 5 single amino acid substitutions, deletions, insertions,or additions to the selected subsequence; the method further comprisesperforming a cycle of steps, the cycle of steps comprising denaturingthe library by increasing the temperature of the second mixture to atemperature effective for denaturing double stranded DNA, followed bysteps (iv) and (v); the method further comprises repeating the cycle ofsteps up to 100 times; the method further comprises amplifying thelibrary by a polymerase chain reaction consisting essentially of thelibrary, a forward primer, and a reverse primer, wherein the forward andreverse primers can hybridize to the 5′ and 3′ end sequences,respectively, of all nucleic acids in the library; the amino acid to beencoded in each position of the X1, X2, or X3 subsequences, is selectedfrom a subset of alanine, arginine, asparagine, aspartate, cysteine,glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine,methionine, phenylalanine, proline, serine, threonine, tryptophan,tyrosine, and valine; the amino acid selected for each single amino acidsubstitution is selected from a group of amino acids consisting of atleast one aliphatic, at least one acidic, one at least one neutral, andat least one aromatic amino acid; and the group of amino acids consistsof alanine, aspartate, serine, and tyrosine.

Also disclosed is a method of generating the library, comprising: (i)providing a parental nucleic acid encoding a parental polypeptidecomprising the amino acid sequence: C1-X1-C2-X2-C3-X3-C4, whereinsubsequence C1 is selected from FIG. 3 or FIG. 5, subsequence C2 isselected from FIG. 3 or FIG. 5, subsequence C3 is selected from FIG. 3or FIG. 5; subsequence C4 is selected from FIG. 3 or FIG. 5; each ofC1-C4 comprises up to 10 single amino acid substitutions, deletions,insertions, or additions to the selected subsequence; and each of X1-X3is an independent subsequence consisting of 2-20 amino acids; (ii)replicating the parental nucleic acid under conditions that introduce upto 10 single amino acid substitutions, deletions, insertions, oradditions to the X1, X2, or X3 subsequences, whereby a population ofrandomly varied subsequences encoding X1′, X2′, or X3′ is generated; and(iii) the population of randomly varied subsequences X1′, X2′, or X3′ issubstituted, into a population of parental nucleic acids at thepositions corresponding to those that encode X1, X2, or X3.

In various instances: at least one of the X1-X3 subsequences is selectedfrom SEQ ID NOs:121-123; each of C1-C4 independently comprises up to 20single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises up to 10single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; each of C1-C4 independently comprises up to 5single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; none of C1-C4 comprise amino acidsubstitutions, deletions, insertions, or additions to the selectedsubsequence; the replicating generates a heterogeneous population ofrandomly varied subsequences by introducing up to 5 amino acidsubstitutions in each of X1, X2, or X3; the method further comprisesamplifying the library by introducing it into a biological replicationsystem and proliferating the biological replication system; thebiological replication system is a plurality of E. coli cells; thebiological replication system is a plurality of bacteriophage; thereplicating occurs in vitro; the replicating is performed with apurified mutagenic polymerase; the replicating is performed in thepresence of a nucleotide analog; the replicating occurs in vivo; and thereplicating in vivo occurs in a mutagenic species of E. coli.

Also described is a method of generating the library, comprising: (i)selecting an amino acid sequence comprising: C1-X1-C2-X2 C3 X3-C4 to beencoded, wherein (a) subsequence C1 is selected from FIG. 3 or FIG. 5,subsequence C2 is selected from FIG. 3 or FIG. 5, subsequence C3 isselected from FIG. 3 or FIG. 5, and subsequence C4 is selected from FIG.3 or FIG. 5; (b) each of C1-C4 comprises up to 10 single amino acidsubstitutions, deletions, insertions, or additions to the selectedsubsequence; (c) each of X1, X2, and X3 consists of an amino acidsequence 2-20 amino acids in length; (ii) providing a first pluralityand a second plurality of oligonucleotides, wherein (a) oligonucleotidesof the first plurality encode the C1-C4 subsequences and multipleheterogeneous X1-X3 variant subsequences X1′-X3′; (b) oligonucleotidesof the second plurality are complementary to nucleotide sequencesencoding the C1-C4 subsequences and to nucleotide sequences encodingmultiple heterogeneous X1′ X3′ subsequences; and (c) theoligonucleotides of the first and second pluralities have overlappingsequences complementary to one another; (iii) combining the populationof oligonucleotides to form a first mixture; (iv) incubating the mixtureunder conditions effective for hybridizing the overlapping complementarysequences to form a plurality of hybridized complementary sequences; and(v) elongating the plurality of hybridized complementary sequences toform a second mixture containing the library.

In various embodiments: each of C1-C4 comprises up to 20 single aminoacid substitutions, deletions, insertions, or additions to the selectedsubsequence; each of C1-C4 independently comprises up to 10 single aminoacid substitutions, deletions, insertions, or additions to the selectedsubsequence; each of C1-C4 independently comprises from zero and up to 5single amino acid substitutions, deletions, insertions, or additions tothe selected subsequence; the method further comprises performing acycle of steps, the cycle comprising denaturing the library byincreasing the temperature of the second mixture to a temperatureeffective for denaturing double stranded DNA, followed by steps (iv) and(v); the method further comprises repeating the cycle up to 100 times;the method further comprises amplifying the library by a polymerasechain reaction consisting essentially of the library, a forward primer,and a reverse primer, wherein the forward and reverse primers canhybridize to the 5′ and 3′ end sequences, respectively, of all nucleicacids in the library; the amino acid to be encoded in each position ofthe X1, X2, or X3 subsequences, is selected from a subset of alanine,arginine, asparagine, aspartate, cysteine, glutamine, glutamate,glycine, histidine, isoleucine, leucine, lysine, methionine,phenylalanine, proline, serine, threonine, tryptophan, tyrosine, andvaline the amino acid selected for each single amino acid substitutionis selected from a group of amino acids consisting of at least onealiphatic, one acidic, one neutral, and one aromatic amino acid; and thegroup of amino acids consists of alanine, aspartate, serine, andtyrosine.

Also described is a library of nucleic acids encoding at least tendifferent polypeptides, wherein: (i) the amino acid sequence of each ofthe encoded polypeptides comprises an amino acid sequence at least 70%identical to any of SEQ ID NOs:127-129; (ii) the amino acid sequence ofeach of the encoded polypeptides includes amino acids that differ fromthose of SEQ ID NOs:127-129 at positions 14, 15, 33, 35-36, 38, 47-48,66, 68-69, 71, 80, 81, 99, 101-102, and 104, and the amino aciddifferences are heterogeneous across a plurality of the encodedpolypeptides; and (iii) the amino acid sequence of each of the encodedpolypeptides outside of the residues corresponding to positions 14, 15,33, 35-36, 38, 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104 of SEQID NOs: 127-129 is homogeneous across a plurality of the encodedpolypeptides.

In various embodiments: the amino acid sequence of the polypeptides hasat least 75% identity to any of SEQ ID NOs 127-129; the amino acidsequence of the polypeptides has at least 80% identity to any of SEQ IDNOs 127-129; and the amino acid sequence of the polypeptides has atleast 85% identity to any of SEQ ID NOs 127-129 each of the nucleicacids comprises a vector sequence. Also disclosed: an isolated nucleicacid encoding a polypeptide, selected from the library; a purifiedpolypeptide encoded by the nucleic acid; a population of cellsexpressing the polypeptides encoded by the library; a cell selected fromthe population of cells; a purified library of polypeptides encoded bythe library; a population of filamentous phage displaying the library ofpolypeptides encoded by the library.

Also disclosed is a method of generating the library, comprising: (i)selecting an amino acid sequence corresponding to any one of SEQ ID NOs:127 129 to be encoded, wherein the selected sequence differs from thoseof SEQ ID NOs:127-129 in at least one of variable positions 14, 15, 33,35-36, 38, 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104; (ii)chemically providing a first and a second plurality of oligonucleotides,wherein (a) oligonucleotides of the first plurality encode amino acidsubsequences of the selected amino acid sequence; the subsequences beingheterogeneous at the encoded variable positions; (b) oligonucleotides ofthe second plurality are complementary to nucleotide sequences encodingsubsequences of the selected amino acid sequence, the subsequences beingheterogeneous at the encoded variable positions; and (c) the first andsecond pluralities comprise oligonucleotides have overlapping sequencescomplementary to one another; (iii) combining the population ofoligonucleotides to form a first mixture; (iv) incubating the mixtureunder conditions effective for hybridizing the overlapping complementarysequences to form a plurality of hybridized complementary sequences; and(v) elongating the plurality of hybridized complementary sequences toform a second mixture containing the library.

In various instances: the method further comprises performing a cycle ofdenaturing the library by increasing the temperature of the secondmixture to a temperature effective for denaturing double stranded DNA,followed by steps (iv) and (v); the method further comprises repeatingthe cycle up to 100 times; the method further comprises amplifying thelibrary by a polymerase chain reaction consisting essentially of thelibrary, a forward primer, and a reverse primer, wherein the forward andreverse primers can hybridize to the 5′ and 3′ end sequences,respectively, of all nucleic acids in the library; the amino acids to beencoded for the variable positions, are selected from a subset ofalanine, arginine, asparagine, aspartate, cysteine, glutamine,glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine,phenylalanine, proline, serine, threonine, tryptophan, tyrosine, andvaline the amino acids selected for the variable positions are selectedfrom a group consisting of an aliphatic, an acidic, a neutral, and anaromatic amino acid; the group of amino acids consists of alanine,aspartate, serine, and tyrosine.

The details of one or more embodiments of the invention are set forth inthe description below. Other features, objects, and advantages of theinvention will be apparent from the description and drawings, and fromthe claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic representation depicting the generation of alibrary of nucleic acids encoding chimeric binding polypeptides bydiversifying subsequences within an encoded polypeptide scaffoldsequence. The encoded scaffold polypeptide sequence is designated as SEQID NO:124. The encoded chimeric binding polypeptides included in thelibrary are SEQ ID NOs:844, 845, and 846, respectively (i.e., from topto bottom).

FIG. 2 is an alignment of the sequences of a number of proteins thathave regions which can be used as a scaffold. These proteins arehomologous to oryzacystatin. The C1, C2, C3 and C4 are boxed andlabeled. The sequences shown are SEQ ID NO:132 (i.e., the conservedsequence among the nine homologous sequences of fromQ2V816_CUCMA_(—)1441/1-28 to Q2V814_CUCMO_(—)734/1-28); SEQ ID NO:133(i.e., Q2V8H9_LAGLE_(—)431/1-28); SEQ ID NO:134 (i.e., the conservedsequence between the two homologous sequences ofQ6DKU9_CUCMA_(—)1441/1-28 and Q6DLC8_CUCMA_(—)1441/1-28); SEQ ID NO:135(i.e., 080389_CUCSA_(—)795/1-89); SEQ ID NOs:136-150 (i.e.,QIRVW3_MEDTR_(—)2578/1-54 to Q8GZV2_CHEMJ_(—)340/1-38); SEQ ID NO:130(i.e., Reference/1-102); and SEQ ID NOs:151-198 and 200-330 (i.e.,CYT1_ORYSA_(—)1097/1-88 to end).

FIG. 3 is an alignment of the sequences of a number of proteins thathave regions which can be used as a scaffold. These proteins arehomologous to C2. The C1, C2, C3 and C4 are boxed and labeled. Sheets1-3 show SEQ ID NOs:331-367 (i.e., Q9M366_ARATH_(—)43120/1-78 toQ9FJG3_ARATH_(—)325405/1-81); SEQ ID NO:131 (i.e., Reference/1-156); andSEQ ID NOs:368-384 (i.e., ERG1_ORYSA_(—)795/1-89 toQ4JHI8_CUCMA_(—)692/1-87). Sheets 4-6, 7-9, 10-12, 13-15, 16-18, 19-21,22-24, and 25-27 show SEQ ID NOs:385-827.

FIG. 4 is an alignment of the sequences of a number of proteins thathave regions which can be used as a scaffold. The sequences shown areSEQ ID NO:130 (i.e., oryza full) and SEQ ID NOs:828-838. These proteinsare homologous to oryzacystatin. The C1, C2, C3 and C4 are boxed andlabeled.

FIG. 5 is an alignment of the sequences of a number of proteins thathave regions which can be used as a scaffold. The sequences shown are,from top to bottom, SEQ ID NO:131 and SEQ ID NOs:839-843. These proteinsare homologous to C2. The C1, C2, C3 and C4 are boxed and labeled.

DETAILED DESCRIPTION

Diverse libraries of nucleic acids (e.g., cDNA libraries) encoding plantchimeric binding polypeptides, as well as methods for generating themare described below. The amino acid sequences of the library of encodedplant chimeric binding proteins are derived from a scaffold polypeptidesequence that includes subsequences to be varied. The variedsubsequences correspond to putative binding domains of the plantchimeric binding proteins, and are highly heterogeneous in the libraryof plant chimeric binding proteins. In contrast, the sequence of theencoded chimeric binding proteins outside of the varied subsequences isessentially the same as the parent scaffold polypeptide sequence andhighly homogeneous throughout the library of encoded plant chimericbinding proteins. Thus, libraries of plant chimeric binding proteins canserve as a universal molecular recognition library platform forselection of specialized binding proteins for expression in transgenicplants. Libraries of plant chimeric binding proteins can be expressed bytransfected cells (i.e., as expression libraries) and tested forinteraction with a molecular target of interest. For example, expressionlibraries can be screened to identify polypeptides that bind with highspecificity and affinity to polypeptides expressed by plant pests,including nematodes. Ultimately, individual chimeric binding proteinswith desired target binding properties can be expressed in a transgenicplant.

I. Plant Scaffold Polypeptide Sequences

A plant scaffold polypeptide sequence is an amino acid sequence based ona plant protein that is structurally tolerant of extreme sequencevariation within one or more regions. The regions to be varied withinthe scaffold polypeptide sequence are conceptually analogous to thehypervariable regions of immunoglobulins, and form putative bindingdomains in a chimeric binding polypeptide. Thus, a large library ofnucleic acid sequences encoding diverse plant chimeric bindingpolypeptides is produced by diversifying specific sequences within ascaffold polypeptide sequence, as is described in detail below.

Plant scaffold polypeptide sequences are selected to have a number ofproperties, e.g., they: (i) are derived from sequences that are of plantorigin; (ii) encode proteins that tolerate the introduction of sequencediversity structurally; (iii) only contain disulfide bonds that do notinterfere with folding of the polypeptide when expressed in a plant;(iv) express at high levels in diverse plant tissues; and (v) can betargeted to different subcellular locations (e.g., cytoplasm,mitochondria, plastid) or secreted from the cell. Based on theseproperties, plant scaffold polypeptide sequences permit the generationof large libraries of chimeric binding polypeptides with highly diversebinding activities. Libraries of chimeric binding polypeptides can bescreened for binding to a target molecule. Chimeric binding proteinshaving the desired binding activity can subsequently be expressed inplants to confer input traits (e.g., pest or pathogen resistance,drought tolerance) or output traits (e.g. modified lipid composition,heavy metal binding for phytoremediation, medicinal uses). Such bindingproteins can also be used in various affinity-based applications, e.g.,diagnostic detection of an antigen using a sandwich ELISA; histochemicaldetection of antigens; generation of protein biochips; and affinitypurification of antigens.

It is helpful to select the scaffold polypeptide sequence based on thesequence of a plant protein or protein domain of known three dimensionalstructure (see, e.g., Nygren et al. (2004) “Binding Proteins fromAlternative Scaffolds,” J. of Immun. Methods 290:3-28). However, evenwithout experimentally determined structural data for a potentialscaffold polypeptide sequence, valuable inferences can be gleaned fromcomputational structural analysis of a candidate amino acid sequence.Useful programs for structure prediction from an amino acid sequenceinclude, e.g., the “SCRATCH Protein Predictor” suite of programsavailable to the public on the world wide web atics.uci.edu/˜baldig/scratch/index. It is important that introduction ofsequence variation not destabilize the known or predicted secondarystructure of the scaffold polypeptide sequence. Accordingly, the knownor predicted secondary structure of the scaffold polypeptide sequenceinforms the selection of amino acid subsequences that can be variedwithin a scaffold polypeptide sequence to form putative binding domains.The structural adequacy of a particular scaffold polypeptide sequencecan be readily tested, e.g., by phage display expression analysismethods that are commonly known in the art. For example, a scaffoldpolypeptide sequence containing 0, 1, 2, 3, or more disulfide bonds canbe tested for its ability to fold into a stable protein. Since proteinsthat do not fold properly will not be incorporated into a phage coat,they will not be displayed. Thus, without undue effort, many candidatescaffold polypeptide sequences can be rapidly screened for their abilityto fold into stable proteins once expressed.

The plant scaffold polypeptide sequences can be based on the accessorydomain from purple acid phosphatases (PAPs). The crystal structure ofthe PAP accessory domain of kidney bean, Phaseolus vulgaris, has beendetermined (Strater et al. (1995), Science 268(5216):1489-1492). Threeexposed loops within the protein are reminiscent of the hypervariabledomains found in immunoglobulins. The loops are brought together by therigid anti-parallel β-sheet framework of the protein. The subsequencesthat form each loop form the putative binding domains of a chimericbinding protein derived from a PAP. These subsequences are diversifiedby substituting, deleting, inserting, or adding up to 10 (e.g., up to 3,4, 6, 8) amino acids. The loops that form the putative binding domainsare particularly well suited to binding target molecules containingpockets or clefts.

PAP-based scaffold polypeptide sequences take the general form:C₁-X₁-C₂-X₂-C₃-X₃-C₄where C₁, C₂, C₃, and C₄ correspond to “backbone” subsequences which caninclude some introduced variation, but are not highly diversified. Onthe other hand, X₁, X₂, and X₃ correspond to highly varied subsequencesthat form the putative binding domains of each PAP-based chimericbinding protein. Table 1 shows a list of suitable C₁-C₄ backbonesubsequences derived from the amino acid sequences of 30 PAPs.

C₁, C₂, C₃, and C₄ correspond to SEQ ID NOs: 1-30, 31-60, 61-90, and91-120, respectively, in Table 1.

X₁, X₂, and X₃ can be based on naturally occurring variants ofcorresponding PAP sequences, e.g., those shown in Table 2 as SEQ ID NOs:121-123. Table 2 shows the range variation at each amino acid positionin subsequences corresponding, respectively, to X₁, X₂, and X₃, within30 naturally occurring PAP sequences. Alternatively, the parent variablesubsequences, X₁,-X₃, can be arbitrary sequences 2-20 amino acids inlength.

In some implementations, C₁, C₂, C₃, and C₄ of a scaffold polypeptidesequence can be selected from multiple PAP-based scaffold polypeptidesequence sequences listed in Table 1, in any combination, e.g.,C_(1(SEQ ID NO:5)), C_(2(SEQ ID NO:12)); C_(3(SEQ ID NO:7)), andC_(4(SEQ ID NO:19)); C_(1(SEQ ID NO:5)), C_(2(SEQ ID NO:12)),C_(3(SEQ ID NO:5)), and C_(4(SEQ ID NO:12)); C_(4(SEQ ID NO:22));C_(1(SEQ ID NO:17)), C_(2(SEQ ID NO:17)), C_(3(SEQ ID NO:19)), andC_(4(SEQ ID NO:1)), and so forth.

TABLE 1 SPSs Based on the Accessory Domain of PAPs Seq ID C₁   1PQQVHITQGDHVGKAVIVSWVT   2 PQQVHITQGDLVGKAVIVSWVT   3PQQVHITQGDLVGRAMIISWVT   4 PQQVHITQGDLVGKAVIVSWVT   5PQQVHITQGDHVGKAVIVSWVT   6 PQQVHITQGDHVGKAMIVSWVT   7PQQVHITQGDHVGKAMIVSWVT   8 PQQVHITQGDHEGKTVIVSWVT   9PQQVHITQGDLVGQAMIISWVT  10 PQQVHITQGDLVGQAMIISWVT  11PQQVHITQGDHVGKAMIVSWVT  12 PQQVHITQGDLEGEAMIISWVR  13PQQVHITQGDHVGKAVIVSWVT  14 PQQVHITQGDHVGQAMIISWVT  15PQQVYITQGDHEGKGVIASWTT  16 PQQVHITQGDYEGKGVIISWVT  17PQQVHITQGDLVGRAMIISWVT  18 PQQVHLTQGDHVGKGVIVSWVT  19PQQVHITQGDVEGKAVIVSWVT  20 PQQVHVTQGNHEGNGVIISWVT  21PQQVHVTQGNHEGNGVIISWVT  22 PQQVHITQGDYDGKAVIVSWVT  23PQQVHITQGDHEGRSIIVSWIT  24 PQQVHITLGDQTGTAMTVSWVT  25PQQVHITQGDYDGKAVIVSWVT  26 PQQVHITQGDYDGKAVIISWVT  27PQQVHITQGDYDGEAVIISWVT  28 PQQVHITQGDYDGKAVIISWVT  29PQQVHITQGDYDGKAVIISWVT  30 PQQVHITQGDYNGKAVIVSWVT C₂  31VVVYWSENSKYKKSAEGTVTT  32 EVHYWSENSDKKKIAEGKLVT  33AVRYWSEKNGRKRIAKGKMST  34 EVHYWSENSDKKKIAEGKLVT  35AVRYWSKNSKQKRLAKGKIVT  36 KVVYWSENSQHKKVAKGNIRT  37KVVYWSENSQHKKVARGNIRT  38 TVLYWSEKSKQKNTAKGKVTT  39QVIYWSDSSLQNFTAEGEVFT  40 QVIYWSDSSLQNFTAEGEVFT  41TVLYWSNNSKQKNKATGAVTT  42 KVLYWIDGSNQKHSANGKITK  43TVVYWSEKSKLKNKANGKVTT  44 EVIYWSNSSLQNFTAEGEVFT  45SVLYWAENSNVKSSAEGFVVS  46 TVVYWAENSSVKRRADGVVVT  47AVRYWSEKNGRKRIAKGKMST  48 KVLYWEFNSKIKQIAKGTVST  49KVIYWKENSTKKHKAHGKTNT  50 TVRYWCENKKSRKQAEATVNT  51TVQYWCENEKSRKQAEATVNT  52 KVQFGTSENKFQTSAEGTVSN  53TVFYGTSENKLDQHAEGTVTM  54 TVRYGSSPEKLDRAAEGSHTR  55EVVYGTSPNSYDHSAQGKTTN  56 HIQYGTSENKFQTSEEGTVTN  57EVRYGLSEGKYDVTVEGTLNN  58 QVHYGAVQGKYEFVAQGTYHN  59QVHYGAVQGKYEFVAQGTYHN  60 EVLYGKNEHQYDQRVEGTVTN C₃  61YIHHCYIKGLEYDTKYYYV  62 FIHHTTIRNLEYKTKYYYE  63 FIHHTTIRKLKYNTKYYYE  64FIHHTTIRNLEYKTKYYYE  65 FIHHTTIRNLEYNTKYYYE  66 YIHHCTIRNLEYNTKYYYE  67YIHHCTIRNLEYNTKYYYE  68 YIHHSTIRHLEFNTKYYYK  69 FIHHTTITNLEFDTTYYYE  70FIHHTTITNLEFDTTYYYE  71 YIHHCIIKHLKFNTKYYYE  72 FIHHCTIRRLKHNTKYHYE  73YIHHCNIKNLKFDTKYYYK  74 FIHHTNITNLEFNTTYFYV  75 YIHHCTIKDLEFDTKYYYE  76YIHHCTIKDLEYDTKYYYE  77 YIHHCTIKNLEYNTKYFYE  78 YIHHCTIQNLKYNTKYYYM  79FIHHCPIRNLEYDTKYYYV  80 YIHHCLIDDLEFDTKYYYE  81 YIHHCLIDDLEFDTKYYYE  82YVHHCLIEGLEYKTKYYYR  83 YIHHCVLTDLKYDRKYFYK  84 FIHHCTLTGLTHATKYYYA  85YIHHCLLDKLEYDTKYYYK  86 YIHHCLIEGLEYETKYYYR  87 YIHQCLVTGLQYDTKYYYE  88FIHHCLVSDLEHDTKYYYK  89 FIHHCLVSDLEHDTKYYYK  90 YIHHCLVDGLEYNTKYYYK C₄ 91 SREFWFR  92 TRQFWFV  93 TRRFSFI  94 TRQFWFV  95 TRQFWFV  96 TRSFWFT 97 TRSFWFT  98 ARTFWFV  99 TRQFWFI 100 TRQFWFI 101 PRTFWFV 102 VRSFWFM103 ARTFWFT 104 TRQFWFI 105 TRKFWFV 106 KRQFWFV 107 TRQFWFT 108 RRTFWFV109 ERKFWFF 110 SRRFWFF 111 SRRFWFF 112 SREFWFE 113 ARLFWFK 114 VRTFSFT115 AREFWFH 116 SREFWFK 117 ARKFWFE 118 SREFWFV 119 SREFWFV 120 AREFWFE

TABLE 2 Naturally Occurring Residue Variation in PAP Subsequences Corresponding to X₁, X₂, and X₃ (SEQ ID NOs: 121-123) X₁ X₂X₃ (SEQ ID NO: 121) (SEQ ID NO: 122) (SEQ ID NO: 123) Position PositionPosition a b c d e f g a b c d e f g h i a b c d e f M D E P G S S Y K YY N Y T S G V G L R N T V E A K P N R F F T S P I E I G H E N K L K K TH K N L V E D P V D T F D K M E D Q Q S H E E T K T I T S S A A E E F FK

After diversification of the above-listed subsequences of the scaffoldpolypeptide sequence, the diversified X₁′, X₂′, and X₃′ subsequences arehighly heterogeneous within the library of encoded plant chimericbinding polypeptides, and can each contain up to 10 (e.g., 8, 6, 4, 3)single amino acid substitutions, deletions, insertions, or additionswith respect to SEQ ID NOs: 121-123 listed in Tables 1, respectively(see, e.g., FIG. 1). For example, the length of the amino acid sequencescorresponding to regions X₁, X₂, or X₃ can be unaltered, shortened, orlengthened relative to SEQ ID NOs: 121-123.

The regions outside of the putative binding domains are referred to as“backbone” regions (i.e., C₁, C₂, C₃, and C₄). Unlike the amino acidsequences for X₁, X₂, and X₃, the amino acid sequences of the backboneregions are generally not substantially diversified within the libraryof encoded chimeric binding proteins, although some sequence variationin these regions within the library is permissible. The backbone regionsof a plant scaffold polypeptide sequence can be at least 70% (i.e., 80,85, 90, 95, 98, or 100%) identical to any of SEQ ID NOs: 1-120.Alternatively, the backbone regions can contain up to 30 (i.e., 28, 26,24, 22, 20, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2,or 1) single amino acid substitutions, deletions, insertions oradditions. For example, C₁, C₂, C₃, and C₄ can each include 0, 1, 2, 3,4, or 5 or more single amino acid changes. If amino acid substitutionsare to be introduced into the backbone regions, it is preferable to makeconservative substitutions. A conservative substitution is one thatpreserves the substitutes an amino acid with one that has similarchemical properties (e.g., substitution of a polar amino acid such asserine with another polar amino acid such as threonine).

In one embodiment, the plant scaffold polypeptide sequence is one of SEQID NOs: 124-126 shown below. Sequences corresponding to X₁, X₂, and X₃are in bold and underlined.

SEQ ID NO: 124 PQQVHITQGDHVGKAVIVSWVT MDEPGSS VVVYWSENSKYKKSAEGTVTTYRFYNYTSG YIHHCYIKGLEYDTKYYYV VGIGNT SREFWFR SEQ ID NO: 125PQQVHITQGDLVGKAVIVSWVT VDEPGSS EVHYWSENSDKKKIAEGKLVT YRFFNYSSGFIHHTTIRNLEYKTKYYYE VGLGNT TRQFWFV SEQ ID NO: 126 PQQVHITQGDLVGRAMIISWVTMDEPGSS AVRYWSEKNGRKRIAKGKMST YRFFNYSSG FIHHTTIRKLKYNTKYYYE VGLRNTTRRFSFI

In other embodiments, a plant scaffold polypeptide sequence is based onthe amino acid sequence of plant proteins that have ankyrin-like repeatsAnkryin-like repeats are small turn-helix-helix (THH) repeats consistingof approximately 33 amino acids. The number of THH repeats within ascaffold polypeptide sequence can vary from 2 to 20. The putativebinding sites within the THH repeats are typically non-contiguous, butclustered on the same side of the protein of which they are a part.

A plant THH repeat-containing scaffold polypeptide sequence can have anamino acid sequence that is based on any of SEQ ID NOs: 127-129 listedbelow. High levels of amino acid sequence variation are introduced atthe bolded/underlined residues. The plant THH repeat-containing scaffoldpolypeptide sequences can contain substitutions of up to 3 amino acidsor a deletion in the place of the amino acids corresponding to residues12-13, 33, 35-36, 38, 46-47, 66, 68-69, 71, 79-80, 99, 101-102, 104, and112-113 (residues in bold and underlined) of SEQ ID NOs: 127-129.

SEQ ID NO: 127 GDDLGKKLHLAA SR GHLEIVRVLVEAGADVNA L D KF G R TALHIAA SRGHL EVVKLLLEAGADVNA L D KF G R TALHLAA SR GHLEVVKLLLEAGADVNA L D KF G DTALHVSI DN GNEDIAEILQ SEQ ID NO: 128 GDDLGKKLHLAA SR GHLEIVRVLVEAGADVNAL D KF G R TPLHIAA SK GNE QVVKLLLEAGADPNA L D KF G R TPLHIAA SKGNEQVVKLLLEAGADPNA Q D KF G D TALHVSI DN GNEDIAEILQ SEQ ID NO: 129GSDLGKKLLEAA RA GQDDEVRILMANGADVNA L D KF G R TPLHIAA SK GNEQVVKLLLEAGADPNA L D KF G R TPLHIAA SK GNEQVVKLLLEAGADPNA Q D KF G KTAFDISI DN GNED L AEILQ

The sequence of the scaffold polypeptide sequences can be at least 70%(i.e., 80, 85, 90, 95, 98, or 100%) identical to the sequence outside ofthe foregoing amino acid positions (in bold) of SEQ ID NOS: 127-129.Alternatively, the sequence of the scaffold polypeptide sequencesoutside of the foregoing amino acid positions (in bold) of SEQ IDNOS:127-129 can contain up to 30 (i.e., 28, 26, 24, 22, 20, 18, 17, 16,15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1) single amino acidsubstitutions, deletions, insertions or additions. In some cases it canbe desirable to include additional repeating units. SEQ ID NOs: 127-129have an amino-terminal cap, two internal repeats and a carboxy-terminalcap. It might be desirable to have 1-6 internal repeats. Theamino-terminal cap sequence is aa 1-33. The first internal repeat is34-66 and the second internal repeat is 67-99. The carboxy-terminal capsequence is aa 100-123. The first or the second internal repeats or bothcan be independently repeated 1, 2, 3, 4, 5 or 6 times.

The putative binding sites are formed by amino acid side chainsprotruding from the rigid secondary structure formed by the scaffoldpolypeptide sequence. These proteins may typically form a larger,flatter binding surface and are particularly useful for binding totargets that do not have deep clefts or pockets.

Another suitable scaffold can be based on oryzacystatin (J Biol Chem262:16793 (1987); Biochemistry 39:14753 (2000)), a member of thecystatin/Papain Family (Pfam Identifier PF00031) that is identified as acysteine proteinase inhibitor of rice. The sequence of oryzacystatin isdepicted below. A scaffold having the amino acid sequenceC1-X1-C2-X2-C3-X3-C4 where each of X1, X2, X3 and X4 is a variableregion and C1, C2, C3 and C4 are the backbone regions can be createdbased on oryzacystatin.

(SEQ ID NO: 130) MSSVGGPVLGGVEPVGNENDLHLVDLARFAVTEHNKKANSLLEFEKLVSVKQQVVAGTLYYFTLEVKEGDAKKLYEAKVWEKPWMDFKELQEFKPVDASA NAC1-MSS (aa 1-3 of SEQ ID NO: 130) X1-VGGP (aa 4-7 of SEQ ID NO: 130)C2-VLGGVEPVGNENDLHLVDLARFAVTEHNKKANSLLEFEKLVSV(aa 8-50 of SEQ ID NO: 130) X2-KQQVVAGT (aa 51-58 of SEQ ID NO: 130)C3-LYYFTLEVKEGDAKKLYEAKVWE (aa 59-81 of SEQ ID NO: 130)X3-KPWM (aa 82-85 of SEQ ID NO: 130)C4-DFKELQEFKPVDASANA (aa 86-102 of SEQ ID NO: 130)

FIG. 2 depicts the sequences of a large number of plant proteins alignedwith oryzacystatin. Examples of suitable C1-C4 regions are indicated.FIG. 4 depicts the sequences of a small number of plant proteins alignedwith oryzacystatin. Examples of suitable C1-C4 regions are indicated. Ingeneral, X1 can be a sequence of 2-20 random amino acids (e.g., 3 aminoacids). X2 can be a sequence of 2-20 random amino acids (e.g., 4 aminoacids). X3 can be a sequence of 2-20 random amino acids (e.g., 4 aminoacids).

Yet another suitable can be based on the C2 protein of rice(Biochemistry 42:11625 (2003)), a member of the C2 domain family (PfamIdentifier PF00168) that is thought to be be involved in plant defensesignaling systems. The sequence of rice C2 is depicted below. A scaffoldhaving the amino acid sequence C1-X1-C2-X2-C3-X3-C4 where each of X1,X2, X3 and X4 is a variable region and C1, C2, C3 and C4 are thebackbone regions can be created based on rice C2.

(SEQ ID NO: 131) MAGSGVLEVHLVDAKGLTGNDFLGKIDPYVVVQYRSQERKSSVARDQGKNPSWNEVFKFQINSTAATGQHKLFLRLMDHDTFSRDDFLGEATINVTDLISLGMEHGTWEMSESKHRVVLADKTYHGEIRVSLTFTASAKAQDHAEQVGGW AHSFRQC1-MAGSGVLEVHLVDAKG (aa 1-16 of SEQ ID NO: 131)X1-LTGNDFLGKID (aa 17-27 of SEQ ID NO: 131)C2-PYVVVQYRSQERK (aa 28-40 of SEQ ID NO: 131)X2-SSVARDQGKNP (aa 41-51 of SEQ ID NO: 131)C3-SWNEVFKFQINSTAATGQHKLFLRL (aa 52-76 of SEQ ID NO: 131)X3- MDHDTFSRDDFL (aa 77-88 of SEQ ID NO: 131) C4-GEATINVTDLISLGMEHGTWEMSESKHRVVLADKTYHGEIRVSLTFTASAKAQDHAEQVGGWAHSFRQ (aa 89-156 of SEQ ID NO: 131)

FIG. 3 depicts the sequences of a large number of plant proteins alignedwith rice C2. Examples of suitable C1-C4 regions are indicated. FIG. 4depicts the sequences of a small number of plant proteins aligned withoryzacystatin. Examples of suitable C1-C4 regions are indicated. Ingeneral, X1 can be a sequence of 2-20 random amino acids (e.g., 11 aminoacids). X2 can be a sequence of 2-20 random amino acids (e.g., 11 aminoacids). X3 can be a sequence of 2-20 random amino acids (e.g., 12 aminoacids).

The following sections disclose methods for generating libraries ofnucleic acids encoding chimeric binding proteins based on plant scaffoldpolypeptide sequences.

II. Generation of Nucleic Acid Libraries Based on a Plant ScaffoldPolypeptide Sequence

A large library of nucleic acid sequence variants encoding the plantscaffold polypeptide sequence is created based on one or more plantscaffold polypeptide sequences. The library of 10¹², acids encodes atleast 5 (e.g., 1,000, 10⁵, 10⁶, 10⁷, 10⁹, 10¹², 10¹⁵ or more) differentchimeric binding protein sequences. It is recognized that not everymember of a library generated by the methods described herein willencode a unique amino acid sequence. Nevertheless, it is desirable thatat least 10% (e.g., 25%, 30%, 40%, 50%, 60%, 70%, 75%, or 90%) of theencoded chimeric binding proteins represented in the library be unique.

Prior to diversifying a plant scaffold polypeptide sequence, it may beuseful to estimate computationally the expected sequence diversity to begenerated with a given set of sequence variation parameters. A methodfor estimating sequence diversity is described, e.g., in Volles et al.(2005), 33 (11): 3667-3677. For example, the number of differentsequences expected in a library of nucleic acids generated by PCR can beestimated based on the mutation frequency of the mutagenic polymeraseused for the amplification. Useful algorithms for estimating sequencediversity in randomized protein-encoding libraries can also be found onthe world wide web, e.g., at guinevere.otago.ac.nz/mlrgd/STATS/index.

Libraries of nucleic acids encoding plant chimeric binding proteins canbe generated by a number of known methodologies. Sequence diversity isintroduced into a plant scaffold polypeptide sequence by substitution,deletion, insertion, or addition of amino acids at the highly variablepositions of a scaffold polypeptide sequence as described above. Sincethe set of 20 amino acids that are genetically encoded in plants havesomewhat redundant chemical and structural properties, a subset of aminoacids (e.g., a subset of 4 types of amino acids) that encompasses thisstructural diversity can be adopted for substitutions. For example,amino acids to be used for substitution or insertion can be selected toinclude an acidic amino acid, a neutral amino acid, an aliphatic aminoacid, and an aromatic amino acid (see Table 3). For example, the aminoacids used for substitution could be limited to aspartate, serine,alanine, and tyrosine. Limiting the redundancy of amino acidsubstitutions will increase the overall structural and binding diversityof the library of chimeric binding proteins.

TABLE 3 Chemical Properties of Amino Acids Genetically Encoded in PlantsAcidic Neutral Aliphatic Aromatic Basic Aspartate, Asparagine, CysteineAlanine, Histidine, Arginine, Glutamate, Glutamine, Methionine, Glycine,Phenylalanine, Lysine Proline, Serine, Threonine, Isoleucine,Tryptophan, Tyrosine Leucine, Valine

The library of nucleic acids can be generated in vitro by assembly ofsets of oligonucleotides with overlapping complementary sequences.First, a scaffold polypeptide sequence is selected that is to be encodedby sets of assembled oligonucleotides. The sequences to be encoded inthe variable regions of a given scaffold polypeptide sequence willinclude a multitude of heterogeneous sequences containing substitutions,insertions, deletions in additions in accordance with the library ofchimeric binding polypeptides to be generated as described above. Thescaffold polypeptide sequences to be encoded can include the C₁-C₄subsequences corresponding to any of SEQ ID NOs:1-30, 31-60, 61-90, and91-120, respectively.

One set of oligonucleotides encodes regions of the plant scaffoldpolypeptide sequence where diversity is to be introduced (e.g., at X₁,X₂, and X₃). In contrast, regions of the scaffold polypeptide sequencein which little or no variation is to be introduced (e.g., in backbonedomains of PAP scaffold polypeptide sequences) are encoded by a set ofoligonucleotides encoding amino acid sequences with no less than 70%(i.e., 75%, 80%, 85%, 90%, 95%, or 100%) identity to any one of theabove-mentioned scaffold polypeptide sequences. The details of thismethod are described, e.g., in U.S. Pat. No. 6,521,453, herebyincorporated by reference.

Sequence-varied oligonucleotides used to generate libraries of nucleicacids are typically synthesized chemically according to the solid phasephosphoramidite triester method described by Beaucage and Caruthers(1981), Tetrahedron Letts., 22 (20):1859-1862, e.g., using an automatedsynthesizer, as described in Needham-VanDevanter et al. (1984) NucleicAcids Res., 12:6159-6168. A wide variety of equipment is commerciallyavailable for automated oligonucleotide synthesis. Multi-nucleotidesynthesis approaches (e.g., tri-nucleotide synthesis), as discussed,supra, are also useful.

Nucleic acids can be custom ordered from a variety of commercialsources, such as Sigma-Genosys (at sigma-genosys.com/oligo.asp); TheMidland Certified Reagent Company (mcrc@oligos.com), The Great AmericanGene Company (at genco.com), ExpressGen Inc. (at expressgen.com), OperonTechnologies Inc. (Alameda, Calif.) and many others.

The oligonucleotides can have a codon use optimized for expression in aparticular cell type (e.g., in a plant cell, a mammalian cell, a yeastcell, or a bacterial cell). Codon usage frequency tables are publiclyavailable, e.g., on the world wide web at kazusa.or.jp/codon. Codonbiasing can be used to optimize expression in a cell or on the surfaceof a cell in which binding of a plant chimeric binding protein is to beassessed, and can also be used to optimize expression of the chimericbinding protein in a transgenic organism of commercial interest (e.g., atransgenic plant). In general, codons with a usage frequency of lessthan 10% are not used. Before synthesis oligonucleotide sequences arechecked for potentially problematic sequences, e.g, restriction sitesuseful for subcloning, potential plant splice acceptor or donor sites(see, e.g., cbs.dtu.dk/services/FeatureExtract/), potential mRNAdestabilization sequences (e.g., “ATTTA”), and stretches of more thanfour occurrences of the same nucleotide. Potentially problematicsequences are changed accordingly.

Populations of oligonucleotides are synthesized that encode amino acidvariations in the putative binding regions of the selected scaffoldpolypeptide sequence (e.g., in regions X₁, X₂, and X₃ of a PAP scaffoldpolypeptide sequence).

Preferably, all of the oligonucleotides of a selected length (e.g.,about 10, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90, or 100 or morenucleotides) that correspond to regions where sequence diversity is tobe introduced in the scaffold polypeptide sequence encode all possibleamino acid variations from a diverse set of amino acids as describedabove. This includes N oligonucleotides per N sequence variations, whereN is the number of different sequences at a locus. The Noligonucleotides are identical in sequence, except for the nucleotide(s)encoding the variant amino acid(s). In generating the sequence-variedoligonucleotides, it can be advantageous to utilize parallel or pooledsynthesis strategies in which a single synthesis reaction or set ofreagents is used to make common portions of each oligonucleotide. Thiscan be performed e.g., by well-known solid-phase nucleic acid synthesistechniques, or, e.g., utilizing array-based oligonucleotide syntheticmethods (see e.g., Fodor et al. (1991) Science, 251: 767-777; Fodor(1997) “Genes, Chips and the Human Genome” FASEB Journal. 11:121-121;Fodor (1997) “Massively Parallel Genomics” Science. 277:393-395; andChee et al. (1996) “Accessing Genetic Information with High-Density DNAArrays” Science 274:610-614).

In typical synthesis strategies the oligonucleotides have at least about10 bases of sequence identity to either side of a region of variance toensure reasonably efficient recombination. However, flanking regionswith identical bases can have fewer identical bases (e.g., 4, 5, 6, 7,8, or 9) and can, of course, have larger regions of identity (e.g., 11,12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 50, or more).

The oligonucleotides to be assembled together are incubated to allowhybridization between oligonucleotides containing overlappingcomplementary sequences. Each set of hybridizing overlappingoligonucleotides thereby forms a contiguous nucleic acid interrupted bysmall gaps. These small gaps can be filled to form full length sequencesusing any of a variety of polymerase-mediated reassembly methods, e.g.,as described herein and as known to one of skill. The greatest sequencediversity is introduced in oligonucleotides encoding the plant scaffoldpolypeptide sequence putative binding regions and residues. However,oligonucleotides encoding specific sequence variations can be “spiked”in the recombination mixture at any selected concentration, thus causingpreferential incorporation of desirable modifications into the encodedplant chimeric binding proteins in regions outside of the putativebinding domains.

For example, during oligonucleotide elongation, hybridizedoligonucleotides are incubated in the presence of a nucleic acidpolymerase, e.g., Taq, Klenow, or the like, and dNTP's (i.e., dATP,dCTP, dGTP and dTTP). If regions of sequence identity are large, Taq orother high-temperature polymerase can be used with a hybridizationtemperature of between about room temperature (i.e., about 25° C.) and,e.g., about 65° C. If the areas of identity are small, Klenow, Taq orpolymerases can be used with a hybridization temperature of below roomtemperature. The polymerase can be added to the assembly reaction priorto, simultaneously with, or after hybridization of the oligonucleotides.Afterwards, the resulting elongated double-stranded nucleic acidsequences are denatured, hybridized, and elongated again. This cycle canbe repeated for any desired number of times. The cycle is repeated e.g.,from about 2 to about 100 times.

Optionally, after multiple cycles of combinatorial nucleic acidassembly, the resulting products can be amplified, e.g., by standardpolymerase chain reaction (PCR). A portion of the volume of theabove-described assembly reaction is incubated with unique forward andreverse primers that hybridize universally to the ends of the nucleicacids, as well as dNTPs and a suitable polymerase (e.g., pfupolymerase). The PCR reaction is then carried out for about 10 to 40cycles.

To determine the extent of oligonucleotide incorporation any approachwhich distinguishes similar nucleic acids can be used. For example, thenucleic acids can be cloned and sequenced, or amplified (in vitro or bycloning, e.g., into a standard cloning or expression vector) and cleavedwith a restriction enzyme which specifically recognizes a particularoligonucleotide sequence variant.

It is useful to include rare restriction sites (e.g., Not I) in the 5′ends of the 5′ and 3′ most primers used either in the assembly or PCRreactions. Inclusion of restriction sites in these primers facilitatessubcloning of the nucleic acids into a vector by restriction digestionand subsequent ligation. Alternatively, the assembly reaction or PCRproducts can also be subcloned, without being restriction digested,using standard methods, e.g.,“TA” cloning.

Other methods for introducing diversity into a plant scaffoldpolypeptide sequence can also be used. For example, a scaffoldpolypeptide sequence can be encoded in a nucleic acid template, e.g., aplasmid contruct. Alternatively, a PCR product, mRNA or genomic DNA froman appropriate plant species such as soybean may also serve as atemplate encoding a plant scaffold polypeptide sequence. One or morescaffold polypeptide sequence subsequences to be diversified (e.g., theX₂ region of a PAP scaffold polypeptide sequence) can be diversifiedduring or after amplification from the scaffold polypeptide sequencenucleic acid template by any of a number of error-prone PCR methods.Error-prone PCR methods can be divided into (a) methods that reduce thefidelity of the polymerase by unbalancing nucleotides concentrationsand/or adding of chemical compounds such as manganese chloride (see,e.g., Lin-Goerke et al. (1997) Biotechniques, 23, 409-412), (b) methodsthat employ nucleotide analogs (see, e.g., U.S. Pat. No. 6,153,745), (c)methods that utilize ‘mutagenic’ polymerases (see, e.g., Cline, J. andHogrefe, H. H. (2000) Strategies (Stratagene Newsletter), 13, 157-161and (d) combined methods (see, e.g., Xu, H., Petersen, E. I., Petersen,S. B. and el-Gewely, M. R. (1999) Biotechniques, 27, 1102-1108. OtherPCR-based mutagenesis methods include those, e.g., described by Osuna J,Yanez J, Soberon X, and Gaytan P. (2004), Nucleic Acids Res. 2004, 32(17):e136 and Wong T S, Tee K L, Hauer B, and Schwaneberg, Nucleic AcidsRes. 2004 Feb. 10; 32 (3):e26), and others known in the art.

After generating a population of sequence variants, these can besubstituted into the appropriate region of a chosen plant scaffoldpolypeptide sequence nucleic acid (e.g., a plasmid containing a scaffoldpolypeptide sequence) by subcloning which thereby effectively acts as avector for the library of diversified sequences.

Yet another approach to mutagenizing specific plant scaffold polypeptidesequence regions is the use of a mutagenic E. coli strain (see, e.g., Wuet al. (1999), Plant Mol. Biol., 39 (2):381-386). A nucleic acid vectorcontaining a target sequence to be mutated is introduced into themutator strain, which is then propagated. Error-prone DNA replication inthe mutator E. coli strain introduces mutations into the introducedtarget sequence. The population of altered target sequences is thenrecovered and subcloned into the appropriate position of a nucleic acidencoding the selected plant scaffold polypeptide sequence to generate adiverse library of nucleic acids encoding plant chimeric bindingproteins.

III. Expression And Screening of Plant Chimeric Binding Proteins

The library of nucleic acids based on a plant scaffold polypeptidesequence and encoding plant chimeric binding polypeptides are subclonedinto an expression vector and introduced into a biological replicationsystem to generate an expression library. The expression library can bepropagated and screened to identify plant chimeric binding proteins thatbind a target molecule (TM) of interest (e.g., a nematode, insect,fungal, viral or plant protein).

The biological replication system on which screening of plant chimericbinding proteins will be practiced should be capable of growth in asuitable environment, after selection for binding to a target.Alternatively, the nucleic acid encoding the selected plant chimericbinding protein can be isolated by in vitro amplification. During atleast part of the growth of the biological replication system, theincrease in number is preferably approximately exponential with respectto time. The frequency of library members that exhibits the desiredbinding properties may be quite low, for example, one in 10⁶ or less.

Biological replication systems can be bacterial DNA viruses, vegetativebacterial cells, bacterial spores. Eukaryotic cells (e.g., yeast cells)can also be used as a biological replication system.

In a particularly useful embodiment, a chimeric binding protein-phagecoat protein fusion is encoded in a phagemid construct. The phagemidconstructs are transformed into host bacteria, which are subsequentlyinfected with a helper phage that expresses wild type coat proteins. Theresulting phage progeny have protein coats that include both fusionprotein and wild-type coat proteins. This approach has the advantagethat phage viability is greater compared to viability of phage that haveexclusively chimeric binding protein-coat fusion proteins.Phagemid-based display library construction and screening kits arecommercially available, e.g., the EZnet™ Phage Display cDNA LibraryConstruction Kit and Screening Kit (Maxim Biotech, Inc., San Francisco,Calif.).

Nonetheless, a strain of any living cell or virus is potentially usefulif the strain can be: 1) genetically altered with reasonable facility toencode a plant chimeric binding protein, 2) maintained and amplified inculture, 3) manipulated to display the potential binding protein domainwhere it can interact with the target material, and 4) selected whileretaining the genetic information encoding the expressed plant chimericbinding protein in recoverable form. Preferably, the biologicalreplication system remains viable after affinity-based selection.

When the biological replication system is a bacterial cell or a phagewhich is assembled in the periplasm, the expression vector for displayof the plant chimeric binding protein encodes the chimeric bindingprotein itself fused to two additional components. The first componentis a secretion signal which directs the initial expression product tothe inner membrane of the cell (a host cell when the package is aphage). This secretion signal is cleaved off by a signal peptidase toyield a processed, mature, plant chimeric binding protein. The secondcomponent is an outer surface transport signal which directs thebiological replication system to assemble the processed protein into itsouter surface. This outer surface transport signal can be derived from asurface protein native to the biological replication system (e.g., theM13 phage coat protein gIII).

For example, the expression vector comprises a DNA encoding a plantchimeric binding protein operably linked to a signal sequence (e.g., thesignal sequences of the bacterial phoA or bla genes or the signalsequence of M13 phage gene III) and to DNA encoding a coat protein(e.g., the M13 gene III or gene VIII proteins) of a filamentous phage(e.g., M13). The expression product is transported to the inner membrane(lipid bilayer) of the host cell, whereupon the signal peptide iscleaved off to leave a processed hybrid protein. The C-terminus of thecoat protein-like component of this hybrid protein is trapped in thelipid bilayer, so that the hybrid protein does not escape into theperiplasmic space. As the single-stranded DNA of the nascent phageparticle passes into the periplasmic space, it collects both wild-typecoat protein and the hybrid protein from the lipid bilayer. The hybridprotein is thus packaged into the surface sheath of the filamentousphage, leaving the plant chimeric binding protein exposed on its outersurface. Thus, the filamentous phage, not the host bacterial cell, isthe biological replication system in this embodiment. If a secretionsignal is necessary for the display of the plant chimeric bindingprotein, a “secretion-permissive” bacterial strain can be used forgrowth of the filamentous phage biological replication system.

It is unnecessary to use an inner membrane secretion signal when thebiological replication system is a bacterial spore, or a phage whosecoat is assembled intracellularly. In these cases, the display means ismerely the outer surface transport signal, typically a derivative of aspore or phage coat protein.

Filamentous phage in general are attractive as biological replicationsystems for display of plant chimeric binding proteins, and M13 inparticular, is especially attractive because: 1) the 3D structure of thevirion is known; 2) the processing of the coat protein is wellunderstood; 3) the genome is expandable; 4) the genome is small; 5) thesequence of the genome is known; 6) the virion is physically resistantto shear, heat, cold, urea, guanidinium C1, low pH, and high salt; 7)the phage is a sequencing vector so that sequencing is especially easy;8) antibiotic-resistance genes have been cloned into the genome; 9) Itis easily cultured and stored, with no unusual or expensive mediarequirements for the infected cells, 10) it has a high burst size, eachinfected cell yielding 100 to 1000 M13 progeny after infection; and 11)it is easily harvested and concentrated by standard methods.

For example, when the biological replication system is M13 the gene IIIor the gene VIII proteins can be used as an outer surface targetingsignal. Alternatively, the proteins from genes VI, VII, and IX may alsobe used.

The encoded plant chimeric binding protein can be fused to the surfacetargeting signal (e.g., the M13 gene III coat protein) at its carboxy oramino terminal. The fusion boundary between the plant chimeric bindingprotein and the targeting signal can also include a short linkersequence (e.g., up to 20 amino acids long) to avoid undesirableinteractions between the chimeric binding protein and the fusedtargeting signal. In some embodiments it is advantageous to includewithin the linker sequence a specific proteolytic cleavage site. Inaddition, the amino terminal or carboxy terminal of the fused proteincan include a short epitope tag (e.g., a hemaglutinin tag). Inclusion ofa proteolytic cleavage site or a short epitope tag is particularlyuseful for purification of a library of chimeric binding proteins from apopulation of cells expressing the library. Epitope-tagged chimericbinding proteins can be conveniently purified by proteolytic cleavage oflinker sequence followed by affinity chromatography utilizing anantibody or other binding agent that recognizes the epitope tag.

Many methods exist for screening phage display libraries (see, e.g.,Willats (2002), Plant Mol. Biol., 50:837-854). As commonly practiced,the target molecule of interest is adsorbed to a support and thenexposed to solutions of phage displaying plant chimeric bindingproteins. The target molecule can be immobilized by passive adsorptionon a support medium, e.g, tubes, plates, columns, or magnetic beads.Generally, the adsorptive support medium is pre-blocked, e.g., withbovine serum albumin, milk, or gelatin, to reduce non-specific bindingof the phage during screening. Alternatively, the target molecule can bebiotinylated, so interaction between chimeric binding protein-bearingphage and the target molecule can be carried out in solution. Phage thatbind to the target can then be selected using avidin or streptavidinbound to a solid substrate (e.g., beads or a column).

After phage are allowed to interact with the target molecule,non-interacting phage are removed by washing. The remaining,specifically binding phage are then eluted by one of any number oftreatments including, e.g., lowering or increasing pH, application ofreducing agents, or use of detergents. In one embodiment, a specificproteolytic cleavage site is introduced between the plant chimericbinding protein sequence and the phage coat protein sequence. Thus,phage elution can be accomplished simply by addition of the appropriateprotease.

Eluted phage are then amplified by infection of host cells and cansubsequently be re-screened by the method just outlined to reduce thenumber of false positive binders. During each round of phage screening,care should be taken to include growth of the phage on a solid mediumrather than exclusively in a liquid medium as this minimizes loss ofphage clones that grow sub-optimally.

Plant chimeric binding proteins can also be expressed and screened forbinding solely in vitro using ribosomal display. An exclusively in vitroapproach circumvents the requirement to introduce the library of nucleicacids encoding plant chimeric binding proteins into a biologicalreplication system. Methods for screening polypeptides in vitro byribosomal protein display are described in detail, e.g., in U.S. Pat.No. 6,589,741. The nucleic acids described in the section above aremodified by adding a phage promoter sequence (e.g., a T7 promoter)enabling in vitro transcription, a ribosome binding sequence upstream tothe start of translation of the encoded plant chimeric binding protein,and a transcription termination sequence (e.g., from phage T3). Themodified library of nucleic acids is then transcribed in vitro togenerate a corresponding mRNA population encoding plant chimeric bindingproteins. Plant chimeric binding proteins are then expressed in vitro bytranslating the population of mRNA molecules devoid of stop codons inthe correct reading frame in an in vitro translation system, underconditions that allow the formation of polysomes. The polysomes soformed are then brought into contact with a target molecule underconditions that allow the interaction of plant chimeric binding proteinswith the target molecule. Polysomes displaying chimeric binding proteinsthat interact with the target molecule are then separated fromnon-interacting polysomes displaying no such (poly)peptides; and themRNA associated with the interacting polysome is then amplified (e.g.,by PCR) and sequenced.

Interaction of a plant chimeric binding protein with a target proteincan also be detected in a genetic screen. In the screen, the targetprotein functions as a “bait protein” and each plant chimeric bindingprotein functions as a potential “prey” protein in a binding assay thatutilizes a two-hybrid assay or three-hybrid assay (see, e.g., U.S. Pat.No. 5,283,317; Zervos et al. (1993) Cell 72:223-232; Madura et al.(1993) J. Biol. Chem. 268:12046-12054; Bartel et al. (1993)Biotechniques 14:920-924; Iwabuchi et al. (1993) Oncogene 8:1693-1696;Hubsman et al. (2001) Nuc. Acids Res. February 15; 29 (4):E18; and BrentWO94/10300).

A two-hybrid assay can be carried out using a target polypeptide as thebait protein. In sum, the target polypeptide is fused to the LexA DNAbinding domain and used as bait. The prey is plant chimeric bindingprotein library cloned into the active site loop of TrxA as a fusionprotein with an N-terminal nuclear localization signal, a LexAactivation domain, and an epitope tag (Colas et al. 1996 Nature 380:548;and Gyuris et al. Cell 1993 75:791). Yeast cells are transformed withbait and prey genes. When the target fusion protein binds to a plantchimeric binding protein fusion protein, the LexA activation domain isbrought into proximity with the LexA DNA binding domain and expressionof reporter genes or selectable marker genes having an appropriatelypositioned LexA binding site increases. Suitable reporter genes includefluorescent proteins (e.g., EGFP), enzymes (e.g., luciferase,β-galactosidase, alkaline phosphatase, etc.) Suitable selectable markergenes include, for example, the yeast LEU2 gene.

After identification of one or more target-binding chimeric bindingproteins, the isolated nucleic acids encoding the chimeric bindingproteins can be mutagenized by the methods described herein, to generatesmall expression libraries expressing variant chimeric binding proteins.The chimeric binding protein-variant expression libraries can bescreened to identify chimeric binding protein variants with improvedtarget binding properties (e.g., increased affinity or specificity).

The following specific examples are to be construed as merelyillustrative, and not limitative of the remainder of the disclosure inany way whatsoever. Without further elaboration, it is believed that oneskilled in the art can, based on the description herein, utilize thepresent invention to its fullest extent. All publications cited hereinare hereby incorporated by reference in their entirety.

EXAMPLES Example 1 Design And Expression of Plant Scaffold PolypeptideSequences

Several protein domain families were analyzed for their potential use asscaffolds. A search of PFAM domains (pfam.wust1.edu; see Bateman et al.(2004)), restricting the output to Viridiplantae, was conducted to limitdomains only to those present in green plants. Four protein domainfamilies were selected to develop plant universal molecular recognitionlibraries; the accessory domain of purple acid phosphatase (PAP), plantcystatin, plant C2 domains and the turn-helix-helix (THH) motif found inankyrin repeat proteins.

Three purple acid phosphatase scaffolds were designed having thesequence of SEQ ID NOs:34-36. The amino acid sequence of the accessorydomain from kidney bean PAP was used as a query sequence to BLAST theNCBI database. When the output was restricted to proteins found inViridiplantae, 62 unique sequences were identified. From an alignment ofthese sequences, a consensus plant PAP sequence was generated (SEQ IDNO:34) by selecting the most frequent amino acid at each position in thealignment. The kidney bean (Phaseolus vulgaris) PAP was selected as aparental scaffold (SEQ ID NO:35), because of its known structure. A PAPfrom soybean, Glycine max, was also chosen (SEQ ID NO:36), as thisspecies represents a common crop species in which transgenic productsare generated.

A set of scaffold polypeptide sequences which contain plant ankyrin-likerepeats was also designed. Ankyrin-like repeats are smallturn-helix-helix (THH) motifs consisting of approximately 33 aminoacids. They are common elements of proteins from all organisms and areoften found in tandem arrays of 2 to 20 repeats within a protein.

Three THH scaffolds were generated. These proteins are similar instructure to GA binding protein (GABP-β). This protein consists of THHlike amino and carboxy terminal caps with 3 THH internal repeats. Inthis protein, it is thought that the caps help stabilize the protein byshielding hydrophobic residues found in the internal repeats.

Three hundred and twelve Viridiplantae ankyrin repeats proteins found inPFAM were aligned to aid in designing plant-specific THH scaffolds. Aplant consensus THH sequence was generated by selecting the mostfrequently occurring amino acid at each position. This sequence wastermed the plant consensus internal repeat sequence. This sequence wasused to search the NCBI databases by BLAST alignment to find the closestnatural THH sequence found in plants. A sequence from wheat (Triticumaestivum) was found. The designed repeat based on T. aestivum contains asubstitution of valine for the single cysteine occurring in the T.aestivum sequence. Two sets of N and C terminal caps were generated. Oneset consists of sequences derived from GABP-β and the second set wasderived from the plant THH consensus sequence and optimized to resemblethe structure of GABP-β. In particular, the N terminal cap has anextended alpha-helical structure, while the C terminal cap has atruncated helix compared to the typical THH repeat.

Three THH scaffolds were designed, one consists of plant consensus N andC caps and two plant consensus internal THH repeats (SEQ ID NO:37).Another consists of plant consensus N and C caps and two wheat internalrepeats (SEQ ID NO:38) and the third consists of ankyrin like N and Ccaps with two wheat internal repeats (SEQ ID NO:39).

The genes encoding the plant scaffold polypeptide sequences weredesigned for expression testing in plants, bacteria, and on the surfaceof phage. Codons were selected for plant expression using a publiclyavailable Glycine max codon usage table (at kazusa.or.jp/codon, codonusage tabulated from the international DNA sequence databases: statusfor the year 2000. Nakamura, Y, Gojobori, T and Ikemura, T (2000) Nucl.Acids Res. 28:292.). Codon selection was done manually with the aim forthe final codon frequency to roughly reflect the natural frequency forGlycine max. Rarely used codons (<10% frequency) were not used. Finalsequences were checked for potential problematic sequences, includingremoval of restriction sites needed for cloning, potential plant spliceacceptor or donor sites (see website at cbs.dtu.dk/services/NetPgene/),potential mRNA destabilization sequences (ATTTA) and stretches of morethan 4 occurrences of the same nucleotide. Any potential problematicsequences were altered in the genes by modifying codon usage. Since theTHH sequences have 4 similar repeat sequences within each protein, stepswere taken to reduce nucleotide similarity within repeats; the averagerepeat identity was reduced 10-15% by these means.

Seven constructs were produced using synthetic gene assembly, (threebased on THH scaffold polypeptide sequences, two based on PAP scaffoldpolypeptide sequences, one plant cystatin and one plant C2 domainprotein). The three THH scaffold polypeptide sequences were placed intoa phagemid vector as fusion sequences with the gene III coat protein(gIII) at its carboxy terminus (Phage 3.2, Maxim Biotech, Inc., SouthSan Francisco, Calif.). A 6-His tag was included at the 5′ end of thegene as well as a c-Myc tag between the scaffold gene and the encodedamino terminus of gIII. The phagemid constructs were then packaged intophage particles and the phage were tested for expression and surfacedisplay of the THH scaffold. A phage ELISA using either anti-His andanti-Myc indicated that the THH scaffold proteins were expressed on thesurface of phage in phage ELISAs, suggesting that all 3 THH scaffoldpolypeptide sequence constructs are folding and expressing well on thephage surface. The selected scaffold polypeptide sequences were thenused to generate expression vectors to evaluate their expression intransgenic plants by immunoblotting.

Tobacco leaves were injected with agrobacterium, LB4404 transformed withTHH containing plant expression vectors. Two days later, sections ofleaves injected with agrobacterium were harvested, frozen on dry ice,then ground into a fine powder with a pestle. PBS containing 0.2%Tween-20 was added to the fine powder at a 1:1 weight to volume ratioand additional grinding was done. Insoluble material was removed bycentrifugation and 10 ul of the remaining supernatant was loaded onto a4-12% acrylamide SDS page gel (NuPage, Intvitrogen). Proteins weretransferred to PVDF membranes. Proteins were detected using a ratanti-HA antibody (Roche) and an anti-rat HRP conjugated secondaryantibody (Chemicon). HRP was detected using Amerham Lumigen reagents.

All three THH scaffold were found to be expressed, with the relativelevel of expression of the three scaffolds being TA-THH>CC-THH>. TC-THH.

Other Embodiments

All of the features disclosed in this specification may be combined inany combination. Each feature disclosed in this specification may bereplaced by an alternative feature serving the same, equivalent, orsimilar purpose. Thus, unless expressly stated otherwise, each featuredisclosed is only an example of a generic series of equivalent orsimilar features.

From the above description, one skilled in the art can easily ascertainthe essential characteristics of the present invention, and withoutdeparting from the spirit and scope thereof, can make various changesand modifications of the invention to adapt it to various usages andconditions. Thus, other embodiments are also within the scope of thefollowing claims.

MEGA

What is claimed is:
 1. A library of cDNA encoding at least ten differentpolypeptides, the amino acid sequence of each polypeptide comprising:C₁-C₂-X₂-C₃-X₃-C₄, wherein (i) subsequence C₁ is selected from the C₁sequences boxed and labeled in FIG. 2 and FIG. 4, subsequence C₂ isselected from the C₂ sequences boxed and labeled in FIG. 2 and FIG. 4,subsequence C₃ is selected from the C₃ sequences boxed and labeled inFIG. 2 and FIG. 4; subsequence C₄ is selected from the C₄ sequencesboxed and labeled in FIG. 2 and FIG. 4; (ii) C₁-C₄ are homogeneousacross a plurality of the encoded polypeptides; (iii) each of X₁-X₃ isan independently variable subsequence consisting of 2-20 amino acids;and (iv) each of X₁-X₃ are heterogeneous across a plurality of theencoded polypeptides.
 2. The library of claim 1, wherein saidsubsequences of C₁, C₂, C₃, and C₄ of said plurality of the encodedpolypeptides are homologous to subsequences of C₁, C₂, C₃, and C₄ oforyzacystatin, said subsequences of C₁, C₂, C₃ and C₄ of saidoryzacystatin having the amino acid sequence as set forth in SEQ IDNO:130 at positions 1-3, 8-50, 59-81, and 86-102, respectively.
 3. Amethod of generating the library of claim 1, comprising; (i) providing aparental nucleic acid encoding a parental polypeptide comprising theamino acid sequence: C₁-X₁-C₂-X₂-C₃-X₃-C₄, wherein subsequence C₁ isselected from the C₁ sequences boxed and labeled in FIG. 2 and FIG. 4,subsequence C₂ is selected from the C₂ sequences boxed and labeled inFIG. 2 and FIG. 4, subsequence C₃ is selected from the C₃ sequencesboxed and labeled in FIG. 2 and FIG. 4; subsequence C₄ is selected fromthe C₄ sequences boxed and labeled in FIG. 2 and FIG. 4; each of X₁-X₃is an independent subsequence consisting of 2-20 amino acid positions;(ii) replicating the parental nucleic acid under conditions thatintroduce up to 10 single amino acid substitutions, deletions,insertions, or additions to the X₁, X₂, or X₃ subsequences, whereby apopulation of randomly varied subsequences encoding X₁′, X₂′, or X₃′ isgenerated; and (iii) the population of randomly varied subsequences X₁′,X₂′, or X₃′ is substituted, into a population of parental nucleic acidsat the positions corresponding to those that encode X₁, X₂, or X₃.
 4. Amethod of generating the library of claim 1, comprising: (i) selectingan amino acid sequence comprising C₁-X₁-C₂-X₂-C₃-X₃-C₄ to be encoded,wherein (a) subsequence C₁ is selected from the C₁ sequences boxed andlabeled in FIG. 2 and FIG. 4, subsequence C₂ is selected from the C₂sequences boxed and labeled in FIG. 2 and FIG. 4, subsequence C₃ isselected from the C₃ sequences boxed and labeled in FIG. 2 and FIG. 4;subsequence C₄ is selected from the C₄ sequences boxed and labeled inFIG. 2 and FIG. 4; (b) each of X₁, X₂, and X₃ consists of an amino acidsequence 2-20 amino acid positions in length; (ii) providing a firstplurality and a second plurality of oligonucleotides, wherein (a)oligonucleotides of the first plurality encode the C₁-C₄ subsequencesand multiple heterogeneous X₁-X₃ variant subsequences X₁′-X₃′; (b)oligonucleotides of the second plurality are complementary to nucleotidesequences encoding the C₁-C₄ subsequences and to nucleotide sequencesencoding multiple heterogeneous X₁′-X₃′ subsequences; and (c) theoligonucleotides of the first and second pluralities have overlappingsequences complementary to one another; (iii) combining the populationof oligonucleotides to form a first mixture; (iv) incubating the mixtureunder conditions effective for hybridizing the overlapping complementarysequences to form a plurality of hybridized complementary sequences; and(v) elongating the plurality of hybridized complementary sequences toform a second mixture containing the library.
 5. A library of cDNAencoding at least ten different polypeptides, the amino acid sequence ofeach polypeptide comprising: C₁-X₁-C₂-X₂-C₃-X₃-C₄ wherein (i)subsequence C₁ is selected from the C₁ sequences boxed and labeled inFIG. 3 and FIG. 5, subsequence C₂ is selected from the C₂ sequencesboxed and labeled in FIG. 3 and FIG. 5, subsequence C₃ is selected fromthe C₃ sequences boxed and labeled in FIG. 3 and FIG. 5; subsequence C₄is selected from the C₄ sequences boxed and labeled in FIG. 3 and FIG.5; (ii) C₁-C₄ are homogeneous across a plurality of the encodedpolypeptides; (iii) each of X₁-X₃ is an independently variablesubsequence consisting of 2-20 amino acids; and (iv) each of X₁-X₃ areheterogeneous across a plurality of the encoded polypeptides.
 6. Thelibrary of claim 5, wherein said subsequences of C₁, C₂, C₃, and C₄ ofsaid plurality of the encoded polypeptides are homologous tosubsequences of C₁, C₂, C₃, and C₄ of C2 protein of rice, saidsubsequences of C₁, C₂, C₃ and C₄ of said C2 protein of rice having theamino acid sequence as set forth in SEQ ID NO:131 at positions 1-16,28-40, 52-76, and 89-156, respectively.
 7. A method of generating thelibrary of claim 5, comprising: (i) providing a parental nucleic acidencoding a parental polypeptide comprising the amino acid sequence:C₁-X₁-C₂-X₂-C₃-X₃-C₄, wherein subsequence C₁ is selected from the C₁sequences boxed and labeled in FIG. 3 and FIG. 5, subsequence C₂ isselected from the C₂ sequences boxed and labeled in FIG. 3 and FIG. 5,subsequence C₃ is selected from the C₃ sequences boxed and labeled inFIG. 3 and FIG. 5; subsequence C₄ is selected from the C₄ sequencesboxed and labeled in FIG. 3 and FIG. 5; each of X₁-X₃ is an independentsubsequence consisting of 2-20 amino acid positions; (ii) replicatingthe parental nucleic acid under conditions that introduce up to 10single amino acid substitutions, deletions, insertions, or additions tothe X₁, X₂, or X₃ subsequences, whereby a population of randomly variedsubsequences encoding X₁′, X₂′, or X₃′ is generated; and (iii) thepopulation of randomly varied subsequences X₁′, X₂′, or X₃′ issubstituted, into a population of parental nucleic acids at thepositions corresponding to those that encode X₁, X₂, or X₃.
 8. A methodof generating the library of claim 5, comprising: (i) selecting an aminoacid sequence comprising: C₁-X₁-C₂-X₂-C₃-X₃-C₄ to be encoded, wherein(a) subsequence C₁ is selected from the C₁ sequences boxed and labeledin FIG. 3 and FIG. 5, subsequence C₂ is selected from the C₂ sequencesboxed and labeled in FIG. 3 and FIG. 5, subsequence C₃ is selected fromthe C₃ sequences boxed and labeled in FIG. 3 and FIG. 5; subsequence C₄is selected from the C₄ sequences boxed and labeled in FIG. 3 and FIG.5; (b) each of X₁, X₂, and X₃ consists of an amino acid sequence 2-20amino acid positions in length; (ii) providing a first plurality and asecond plurality of oligonucleotides, wherein (a) oligonucleotides ofthe first plurality encode the C₁-C₄ subsequences and multipleheterogeneous X₁-X₃ variant subsequences X₁′-X₃′; (b) oligonucleotidesof the second plurality are complementary to nucleotide sequencesencoding the C₁-C₄ subsequences and to nucleotide sequences encodingmultiple heterogeneous X₁′-X₃′ subsequences; and (c) theoligonucleotides of the first and second pluralities have overlappingsequences complementary to one another; (iii) combining the populationof oligonucleotides to form a first mixture; (iv) incubating the mixtureunder conditions effective for hybridizing the overlapping complementarysequences to form a plurality of hybridized complementary sequences; and(v) elongating the plurality of hybridized complementary sequences toform a second mixture containing the library.